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RODEO: SPARSE, GREEDY NONPARAMETRIC REGRESSION 

By John Lafferty^ and Larry Wasserman^ 
Carnegie Mellon University 

We present a greedy method for simultaneously performing lo- 
cal bandwidth selection and variable selection in nonparametric re- 
gression. The method starts with a local linear estimator with large 
bandwidths, and incrementally decreases the bandwidth of variables 
for which the gradient of the estimator with respect to bandwidth is 
large. The method — called rodeo (regularization of derivative expec- 
tation operator) — conducts a sequence of hypothesis tests to thresh- 
old derivatives, and is easy to implement. Under certain assumptions 
on the regression function and sampling density, it is shown that the 
rodeo applied to local linear smoothing avoids the curse of dimen- 
sionality, achieving near optimal minimax rates of convergence in the 
number of relevant variables, as if these variables were isolated in 
advance. 

1. Introduction. Estimating a higii-dimensional regression function is 
notoriously difficult due to the curse of dimensionality. Minimax theory pre- 
cisely characterizes the curse. Let 

(1.1) Yi = m{Xi) + Ei, i=l,...,n, 

where Xj = (^^(l), . . . , Xi{d)) G M'^ is a d-dimensional covariate, m : M"' ^ M 
is the unknown function to estimate and ^ N{0, a^). Then if m is in W2{c), 
the d-dimensional Sobolev ball of order two and radius c, it is well known 
that 

(1.2) liminfn^/(^+'^)inf sup TZ{fhn, m) > 0, 
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where 7?.(fn„,m) = Km J {fhnix) — m{x))'^ dx is the risk of the estimate m„ 
constructed from a sample of size n (Gyorfi et al. [12] and Stone et al. [25]). 
Thus, the best rate of convergence is which is impracticahy slow 

if d is large. 

However, for some applications it is reasonable to expect that the true 
function only depends on a small number of the total covariates. Suppose 
that m satisfies such a sparseness condition, so that 

(1.3) m[x) = m{x n) , 

where xr = {xj '.jGR), R C {I, . . . ,d} is a subset of the d covariates, of size 
r = <C d. We call {xj}j^R the relevant variables. Note that if an oracle 
were to identify and isolate the relevant variables, the better minimax rate 
of n~^/^'^~^^^ could be achieved, and this would be the fastest rate possible. 
Thus, we are faced with the problem of variable selection in nonparamet- 
ric regression. Our strategy is to seek a greedy method that incrementally 
searches through bandwidths in small steps. 

A large body of previous work has addressed this fundamental problem, 
which has led to a variety of methods to combat the curse of dimensionality. 
Many of these are based on very clever, though often heuristic techniques. 
For additive models of the form m{x) = ^jfnj{xj), standard methods like 
stepwise selection, Cp and AIC can be used (Hastie, Tibshirani and Fried- 
man [14]). For spline models, Zhang et al. [31] use likelihood basis pursuit, 
essentially the lasso adapted to the spline setting. CART (Breiman et al. 
[1]) and MARS (Friedman [8]) effectively perform variable selection as part 
of their function fitting. Support vector regression can be seen as creating 
a sparse representation using basis pursuit in a reproducing kernel Hilbert 
space (Girosi [11]). There is also a large literature on Bayesian methods, 
including methods for sparse Gaussian processes (Tipping [27], Smola and 
Bartlett [24], Lawrence, Seeger and Herbrich [17]); see George and McCul- 
loch [10] for a brief survey. More recently, Li, Cook and Nachsteim [19] use 
independence testing for variable selection and [2] introduced a boosting 
approach. While these methods have met with varying degrees of empirical 
success, they can be challenging to implement and demanding computation- 
ally. Moreover, these methods are typically very difficult to analyze theoret- 
ically, and so come with limited formal guarantees. Indeed, the theoretical 
analysis of sparse parametric estimators such as the lasso (Tibshirani [26]) 
is challenging, and only recently has significant progress been made on this 
front in the statistics and signal processing communities (Donoho [3], Fu 
and Knight [9], Tropp [28, 29], Fan and Peng [7] and Fan and Li [6]). 

In this paper, we present a new approach for sparse nonparametric func- 
tion estimation that is both computationally simple and amenable to the- 
oretical analysis. We call the general framework rodeo, for "regularization 
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of derivative expectation operator." It is based on the idea that bandwidth 
and variable selection can be simultaneously performed by computing the in- 
finitesimal change in a nonparametric estimator as a function of the smooth- 
ing parameters, and then thresholding these derivatives to get a sparse esti- 
mate. As a simple version of this principle, we use hard thresholding, effec- 
tively carrying out a sequence of hypothesis tests. A modified version that 
replaces testing with soft thresholding may be viewed as solving a sequence 
of lasso problems. The potential appeal of this approach is that it can be 
based on relatively simple and theoretically well-understood nonparametric 
techniques such as local linear smoothing, leading to methods that are sim- 
ple to implement and can be used in high-dimensional problems. Moreover, 
we show that they can achieve near optimal minimax rates of convergence, 
and therefore circumvent the curse of dimensionality when the true function 
is indeed sparse. When applied in one dimension, our method yields a lo- 
cal bandwidth selector and is similar to the estimators of Ruppert [21] and 
Lepski, Mammen and Spokoiny [18]. The method in Lepski, Mammen and 
Spokoiny [18] and its multivariate extension in Kerkyacharian, Lepski and 
Picard [16] yield estimators that are more refined than our method in the 
sense that their estimator is spatially adaptive over large classes of func- 
tion spaces. However, their method is not greedy: it involves searching over 
a large class of bandwidths. Our goal is to develop a greedy method that 
scales to high dimensions. 

Our method is related to the structural adaptation method of Hristache 
et al. [15] and Samarov, Spokoiny and Vial [23], which is designed for multi- 
index models. The general multi-index model is 



where x G M'^ and T is a linear orthonormal mapping from onto W~ with 
r <d. Variable selection corresponds to taking T to be a r by d matrix of O's 
and I's with each Tij = 1 if Xj is the ith relevant variable. Nonparametric 
variable selection can also be regarded as a special case of the partially linear 
model in Samarov, Spokoiny and Vial [23], which takes 



where x = {xi,X2)- Taking 9 to be zero yields the model in this paper. The 
advantage of structural adaptation is that it yields, under certain conditions, 
^/n estimates of the image of T in (1.4) and 6 in (1.5). However, structural 
adaptation does not yield optimal bandwidths or optimal estimates of the 
regression function, although this is not the intended goal of the method. 

In the following section we outline the basic rodeo approach, which is 
actually a general strategy that can be applied to a wide range of nonpara- 
metric estimators. We then specialize in Section 3 to the case of local linear 



(1.4) 



Y = go{Tx)+e, 



(1.5) 
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smoothing, since the asymptotic properties of this smoothing technique are 
fairly well understood. In particular, we build upon the analysis of Rup- 
pert and Wand [22] for local linear regression; a notable difference is that 
we allow the dimension to increase with sample size, which requires a more 
detailed analysis of the asymptotics. In Section 4 we present some simple 
examples of the rodeo, before proceeding to an analysis of its properties in 
Section 5. Our main theoretical result characterizes the asymptotic running 
time, selected bandwidths, and risk of the algorithm. Finally, in Section 6, we 
present further examples and discuss several extensions of the basic version 
of the rodeo considered in the earlier sections. The proofs of the theoretical 
properties of the rodeo are given in Section 7. 

2. Rodeo: the main idea. The key idea in our approach is as follows. Fix 
a point X and let fhh{x) denote an estimator of m{x) based on a vector of 
smoothing parameters h = (hi, . . . , h^i). If c is a scalar, then we write h = c 
to mean /i = (c, . . . , c) . 

Let M{h) =E(m/i(x)) denote the mean of fh}^{x). For now, assume that 
X = Xi is one of the observed data points and that iriQ{x) = 1^. In that case, 
m{x) = M(0) = E{Yi). If P = {h{t) : < t < 1) is a smooth path through the 
set of smoothing parameters with /i(0) = and h{l) = 1 (or any other fixed, 
large bandwidth) then 

(2.1a) m{x) = M(0) = M(l) + M(0) - M(l) 

(2.11.) ^Mil)-rJ^,s 

Jo ds 

(2.1c) = M(l) - f\D{h{s)),h{s)) ds, 

Jo 

where 

(2.2) „(,) = VMW =(-....,-) 

is the gradient of M{h) and h{s) = ^^j^^ is the derivative of h{s) along the 
path. An unbiased, low variance estimator of M(l) is mi{x). An unbiased 
estimator of D(h) is 

^^^^ _ f dfhhjx) dmh{x) Y 

The naive estimator 

(2.4) m{x)=mi{x)- f {Z {h{s)) , h{s)) ds 







is identically equal to 'mo{x) = Yi, which has poor risk since the variance of 
Z{h) is large for small h. However, our sparsity assumption on m suggests 
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Fig. 1. Conceptual illustration: The bandwidths for the relevant variables (hi) are 
shrunk, while the bandwidths for the irrelevant variables (7i2 ) are kept relatively large. 

that there should be paths for which D{h) is also sparse. Along such a path, 
we replace Z{h) with an estimator D{h) that makes use of the sparsity 
assumption. Our estimate of m{x) is then 



To implement this idea we need to do two things: (i) we need to find a path 
for which the derivative is sparse and (ii) we need to take advantage of this 
sparseness when estimating D along that path. 

The key observation is that if Xj is irrelevant, then we expect that chang- 
ing the bandwidth hj for that variable should cause only a small change 
in the estimator fhh{x). Conversely, if Xj is relevant, then we expect that 
changing the bandwidth hj for that variable should cause a large change 
in the estimator. Thus, Zj{h) = dfhh{x)/dhj should discriminate between 
relevant and irrelevant covariates. To simplify the procedure, we can replace 
the continuum of bandwidths in the interval with a discrete set where each 
hj ^ B = {ho, fdho, P'^ho, . . .} for some < /3 < 1. Moreover, we can proceed 
in a greedy fashion by estimating D{h) sequentially with hj £ B and setting 
Dj{h) = when hj < hj, where hj is the first h such that \Zj{h)\ < Xj{h) 
for some threshold Xj. This greedy version, coupled with the hard thresh- 
old estimator, yields m{x) = m-^(x). A conceptual illustration of the idea is 
shown in Figure 1. 

To further elucidate the idea, consider now the one-dimensional case x £ 
M, so that 



(2.5) 




(2.6) 
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Suppose that fhh{x) = ^27=1 ^i^i{x-> ^) is a hnear estimator, where the weights 
ii{x,h) depend on a bandwidth h. 
In this case 

n 

(2.7) Z{h) = Y,Y/,ix,h) 

i=l 

where the prime denotes differentiation with respect to h. Then we set 

(2.8) m{x)=mi{x)- f D{h)dh 

Jo 

where D{h) is an estimator of D[h). Now, 

(2.9) Z{h)^N{h{h),s^{h)) 

where, for typical smoothers, b{h) w Ah and s^(/i) ~ C /nh? for some con- 
stants A and C . Take the hard threshold estimator 

(2.10) D{h) = Z{h)I{\Z{h)\>\{h)), 

where \{h) is chosen to be slightly larger than s{h). An alternative is the 
soft-threshold estimator 

(2.11) D{h) = sign(Z(/i))(|Z(/i)| - \{h))^. 

The greedy algorithm, coupled with the hard threshold estimator, yields 
a bandwidth selection procedure based on testing. This approach to band- 
width selection is very similar to that of Lepski, Mammen and Spokoiny 
[18], who take 

(2.12) h = max{/i E H : 0(/i, ??) = for ah r/ < /i}, 

where (t){h^ rj) is a test for whether improves on ifih- This more refined test 
leads to estimators that achieve good spatial adaptation over large function 
classes. Kerkyacharian, Lepski and Picard [16] extend the idea to multiple 
dimensions. Our approach is also similar to a method of Ruppert [21] that 
uses a sequence of decreasing bandwidths and then estimates the optimal 
bandwidth by estimating the mean squared error as a function of bandwidth. 
Our greedy approach only tests whether an infinitesimal change in the band- 
width from its current setting leads to a significant change in the estimate, 
and is more easily extended to a practical method in higher dimensions. 

3. Rodeo using local linear regression. Now we present the multivariate 
rodeo in detail. We use local linear smoothing as the basic method since it 
is known to have many good properties. Let x = {x{l), . . . ,x{d)) be some 
target point at which we want to estimate m. Let fhH{x) denote the local 
linear estimator of m{x) using bandwidth matrix H. Thus, 

(3.1) mnix) = eJiX^W^X^r'X^W^Y = S^Y, 
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where ei = (1, 0, . . . , 0)"'", 

/I {X^-xf 

(3.2) X^= ■. : 

Vl {Xn-xf 

Wx is diagonal with element KniXi-x) andKniu) = \H\"^^'^ K{H~^/'^u). 
The estimator iriH can be written as 

n 

(3.3) fhHix)=Y,G{Xi,x,h)Y,, 

1=1 

where 

(3.4) G{u, X, h) = eJiX^WxX,)-^ _^^^t )kh{u- x) 

is called the effective kernel. One can regard local linear regression as a refine- 
ment of kernel regression where the effective kernel G adjusts for boundary 
bias and design bias; see Fan [5], Hastie and Loader [13] and Ruppert and 
Wand [22]. 

We assume that the covariates are random with density f{x) and that x is 
interior to the support of /. We make the same assumptions as Ruppert and 
Wand [22] in their analysis of the bias and variance of local linear regression. 
In particular: 

(i) The kernel K has compact support with zero odd moments and there 
exists V2 = i'2{K) 7^ such that 

(3.5) ( uu^K{u)du = U2{K)I, 



where / is the dx d identity matrix. 

(ii) The sampling density f{x) is continuously differentiable and strictly 
positive. 

In the version of the algorithm that follows, we take ii' to be a product 
kernel and H to be diagonal with elements h = (hi, . . . , hd) and we write fhh 
instead of ifiH- 

Our method is based on the statistic 

(3.6) z, = ^^^ = Y^G,{X,,x,h)Y,, 

where 

dG{u, X, h) 



(3.7) Gj{u,x,h) 



dhj 
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Rodeo: Hard thresholding version 



1. Select constant < /? < 1 and initial bandwidth 

(3.11) ho = -^. 

log log n 

2. Initialize the bandwidths, and activate all covariates: 

(a) hj = ho, j = 1,2, . . . ,d. 

(b) ^ = {1,2,. ..,4. 

3. While A is nonempty, do for each j G A: 

(a) Compute the estimated derivative expectation: Zj [equation (3.6)] 
and Sj [equation (3.9)]. 

(b) Compute the threshold Xj = Sj-\/21ogn. 

(c) If \Zj\ > Xj, then set hj <— jShj; otherwise remove j from A. 

4. Output bandwidths h* = (hi, . . . , h^j and estimator fh[x) = fhh*{x). 

Fig. 2. The hard thresholding version of the rodeo, which can be applied using the deriva- 
tives Zj of any nonparametric smoother. The algorithm stops when all derivatives are below 
threshold. As shown in the theoretical analysis, this happens after Tn = O(logn) iterations. 

Let 

n 

(3.8) fij = fijih) = E(Zj|Xi, ...,Xn) = Y^ G,iXi,x, h)m{Xi) 

i=l 

and 



(3.9) Yar{Zj\Xi , . . . , X„) = ^ {Xi,x, hf 

i=l 



In Section 4.3 we explain how to estimate a; for now, assume that a is 
known. The hard thresholding version of the rodeo algorithm is described 
in Figure 2. 

To derive an explicit expression for Zj, equivalently Gj, we use 

to get that 
(3.12a) Z, = 

= eJiX^WX)-^X^^Y 
dhj 
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(3-12b) 

- ef{x^wxy^x^—x{x^wx)-^x^wY 

ohj 

(3.12c) =eJ{X^WX)-'x''^{Y -Xa), 

dhj 

where a = {X"^WX)~^X^WY is the coefficient vector for the local hnear fit 
(and we have dropped the dependence on the local point x in the notation). 

Note that the factor = nf=i the kernel cancels in the expres- 

sion for in, and therefore we can ignore it in our calculation of Zj. Assuming 
a product kernel we have 



/ d d 
(3.13) W = diag n KiiXi, - x,)/h,), . . . , n K{{Xn, 

\7 = 1 .7 = 1 



Xj)/hj 



and dW /dhj = WLj, where 

'd\ogK{{Xij - Xj)/hj) d\ogK{{Xnj - X,)/hj) 



(3.14) Lj=diag 
and thus 



dhj ' ' " ' dhj 



Zj = eJ{X'^WX)~^X'^WLj{Y - Xa) 

(3.15) 

= ejBLj{I - XB)Y = Gj (x, hf^Y 

where B = [X^WX)-^X^W. 

The calculation of Lj is typically straightforward. As two examples, with 
the Gaussian kernel K{u) = exp(— u^/2) we have 



(3.16) Lj = diag((Xij - Xjf, ...,{X„ 



Xi 



3 



and for the Epanechnikov kernel K{u) = (5 — x'^)l{\x\ < \/5) we have 
(3,17a) L, = d„g(^^<|!i^^I(|^,, - .,1 < ^/5ft,). . . , 

4. Examples. In this section we illustrate the rodeo on some examples. 
We return to the examples later when we discuss estimating cr, as well as a 
global (nonlocal) version of the rodeo. 
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Fig. 3. Rodeo run on synthetic data sets, showing average bandwidths over 200 runs 
(left), final bandwidths with standard errors (right), and bandwidths on a single run 
of the algorithm (center). In the top plots the regression function is m{x) — 5x1x2 
with d — 10, n — 500 and a — 0.5 and in the lower plots the regression function is 
m(x) = 2{xi + 1)^ + 2sin(102;2), d = 20, n = 750 and a = 1. The figures show that the 
bandwidths for the relevant variables xi and X2 are shrunk, while the bandwidths for the 
irrelevant variables remain large. 

4.1. Two relevant variables. In the first example, we take m(x) = Sxfx^ 
with d = 10, a = 0.5 with Xi ~ Uiiiform(0, 1). The algorithm is applied to 
the local linear estimates around the test point xq = . . . , |), with /3 = 0.8. 
Figure 3 shows the bandwidths averaged over 200 runs of the rodeo, on data 
sets of size n = 750. The second example in Figure 4 shows the algorithm 
applied to the function m{x) = 2(xi + 1)'^ + 2sin(10x2), in this case in d = 20 
dimensions with a = 1. 

The plots demonstrate how the bandwidths hi and /i2 of the relevant 
variables are shrunk, while the bandwidths of the relevant variables tend to 
remain large. 
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Fig. 4. Squared error of the estimator on the previous examples, m{x) — ^x\x% (left) and 
m{x) — 2(xi + 1)^ + 2sin(10a;2) (right). For each plot, the left six boxplots show the risk 
in different dimensions ('d = 5, 10, 15, 20, 25, 30^ when using a single bandwidth, chosen by 
leave-one-out cross validation. The right six boxplots show the squared error on the same 
data with bandwidths selected using the rodeo. 



4.2. A one- dimensional example. Figure 5 illustrates the algorithm in 
one dimension. The underlying function in this case is m{x) = (1/x) sin(15/2;), 
and n = 1,500 data points are sampled as x ~ Uniform(0, 1) + ^. The algo- 
rithm is run at two test points; the function is more rapidly varying near the 
test point x = 0.67 than near the test point x = 1.3, and the rodeo appro- 
priately selects a smaller bandwidth at x = 0.67. The right plot of Figure 5 
displays boxplots for the logarithm of the final bandwidth in the base 1//3 
(equivalently, minus the number of steps in the algorithm) where (3 = 0.8, 
averaged over 50 randomly generated data sets. 

The figure illustrates how smaller bandwidths are selected where the func- 
tion is more rapidly varying. However, we do not claim that the method is 
adaptive over large classes of function spaces. As discussed earlier, the tech- 
nique is intentionally a greedy algorithm; adapting to unknown smoothness 
may require a more refined search over bandwidths that does not scale to 
large dimensions, and is out of the scope of the current paper. 

4.3. Estimating a . The algorithm requires that we insert an estimate a 
of a in (3.9). An estimator for a can be obtained by generalizing a method 
of Rice [20]. For i < ^, let 

(4.1) dit = \\Xi-X,\\. 
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Fig. 5. A one- dimensional example. The regression function is m{x) — (l/s) sin(15/2;), 
and n = 1,500 data points are sampled, x ~ Umform(0, 1) + ^. The left plot shows the 
local linear fit at two test points; the right plot shows the final log bandwidth, logj^yg h^, 
(equivalently, minus the number of steps) of the rodeo over 50 randomly generated data 
sets. 



Fix an integer J and let £ denote the set of pairs corresponding the J 
smahest values of d,/. Now define 



(4.2) 

Then 

(4.3) 

where 

(4.4) 

with D given by 
(4.5) 



a 



1 

2J 



E(a2) = + bias, 



bias < D sup E 



i=i 



dmj{x) 



dx-i 



D = max \\Xi — XA\. 



There is a bias-variance tradeoff: large J makes positively biased, and 
small J makes highly variable. Note, however, that the bias is mitigated 
by sparsity (small r). 
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Fig. 6. Rodeo run on the examples of Section 4-1, but now estimating the noise using 
the estimate a discussed in Section 4-3. Top: a = 0.5, d = 10; bottom: a = 1, d = 20. In 
higher dimensions the noise is over-estimated (center plots), which results in the irrelevant 
variables being more aggressively eliminated; compare Figure 3. 



A more robust estimate may result from taking 

(4.6) a = ^median{\Y,-Ye\}u^£ 

where the constant comes from observing that if Xi is close to Xi, then 

(4.7) \Yi-Ye\r^\N{0,2a^)\ = V2a\Z\, 

where Z is a standard normal with E|Z| = \/2pK. 

Now we redo the earlier examples, taking a as unknown. Figure 6 shows 
the result of running the algorithm on the examples of Section 4.1, however, 
now estimating the noise using estimate (4.6). For the higher-dimensional 
example, with d = 20, the noise variance is over-estimated, with the primary 
result that the irrelevant variables are more aggressively thresholded out; 
compare Figure 6 to Figure 3. 



14 



J. LAFFERTY AND L. WASSERMAN 



Although we do not pursue it in this paper, there is also the possibility 
of allowing cr{x) to be a function of x and estimating it locally. 

4.4. Computational cost. When based on a local linear estimator, each 
step of the rodeo algorithm has the same computational cost as constructing 
a single local linear fit. This is dominated by the cost of constructing the 
matrix inverse (X^WX)-'^ in equation (3.15). Since the derivative needs 
to be computed for every variable, the algorithm thus scales as 0{d^) in 
the dimension d. Implemented in R, the 20 dimensional example in Figure 3 
takes 4 hours, 4 minutes and 40 seconds for 200 runs, or 73.4 seconds per run, 
when executed on a 1.5 GHz PowerPC Macintosh laptop. Although we focus 
on local linear regression, it should be noted that very similar results are 
obtained with kernel regression, which requires no matrix inversion. Using 
kernel regression, the same example requires 12 minutes and 33 seconds, or 
3.7 seconds per run. 

5. Properties of the rodeo. We now give some results on the statistical 
properties of the hard thresholding version of the rodeo estimator. Formally, 
we use a triangular array approach so that m{x), f{x), d and r can all change 
as n changes, although we often suppress the dependence on n for notational 
clarity. We assume throughout that m has continuous third order derivatives 
in a neighborhood of x. For convenience of notation, we assume that the 
covariates are numbered such that the relevant variables Xj correspond to 
1 < i < and the irrelevant variables Xj correspond to r + \ < j < d. 

A key aspect of our analysis is that we allow the dimension d to increase 
with sample size n, and show that the algorithm achieves near optimal min- 
imax rates of convergence if d = 0(logn/loglogn). This hinges on a careful 
analysis of the asymptotic bias and variance of the estimated derivative Zj , 
taking the increasing dimension into account. We conjecture that, without 
further assumptions, d cannot increase at a significantly faster rate, while 
obtaining near optimal rates of convergence. 

The results are stated below, with the complete proofs given in Section 7. 

Our main theoretical result characterizes the asymptotic running time, 
selected bandwidths and risk of the algorithm. In order to get a practical 
algorithm, we need to make assumptions on the functions m and /. 

(Al) The density f{x) of (Xi, . . . is uniform on the unit cube. 



(A3) All derivatives of m up to and including fourth order are bounded. 

Assumption (Al) greatly simplifies the proofs. If we drop (Al), it is nec- 
essary to use a smaller starting bandwidth. 



(A2) 





n— >oo l<j<r 



GREEDY REGRESSION 



15 



Theorem 5.1. Assume the conditions of Lemma 7.1 and suppose that 
assumptions (Al) and (A2) hold. In addition, suppose that 



and 



^min = min |mjj(x)| =0(1) 



^max = ma.x\mjj{x)\ = 0(1). 

3<r 



Then the rodeo outputs bandwidths h* that satisfy 

(5.2) F{h* = ho for all j > r) ^ 1 
and for every e > 0, 

(5.3) P(n-^/(^+'^)-^ < h* < n-^/(^+'^)+^ for all j<r)^l. 

Let Tn be the stopping time of the algorithm. Then P(t_L < tjj) — > 1 
where 

(5-4) tL = , , log 



mm 



(r + 4) log(l/;3) ^ V 8C2 log n (log log nY)' 



1 , / nAi 



^max 



^^'^^ (r + 4)log(l/;3)^°^VaC2logn(loglogn^'^ 

and < a < 2. 



Corollary 5.2. Under the conditions of Theorem 5.1, 
(5.6) (fhh* (x) - m{x)f = Op(?i-^/(4+^)+^) 

for every e > 0. 



6. Extensions and variations of the rodeo. The rodeo represents a gen- 
eral strategy for nonpar ametric estimation, based on the idea of regularizing 
or testing the derivatives of an estimator with respect to smoothing param- 
eters. There are many ways in which this basic strategy can be realized. 
In this section we discuss several variants of the basic hard thresholding 
version of the rodeo, including a soft thresholding version, a global rather 
than local bandwidth selection procedure, the use of testing and generalized 
cross validation, and connections to least angle regression. Further numerical 
examples are also given to illustrate these ideas. 
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Rodeo: Soft thresholding version 



1. Select parameter < (3 <1 and initial bandwidth Hq. 

2. Initialize the bandwidths, and activate all covariates: 

(a) hj = ho, j = l,2,...,d. 

(b) ^ = {1,2,. 

(c) Initialize step, t=l. 

3. While A is nonempty 

(a) Set dhj {t) = 0, j = I, . . . ,d. 

(b) Do for each j £ A: 

(1) Compute the estimated derivative expectation Zj and sj. 

(2) Compute the threshold Xj = Sj\/21ogn. 

(3) If \Zj\ > Xj, set dhj{t) = (1 — f3)hj and hj <— (3hj; 
otherwise remove j from A. 

(4) Set D,{t)=sign{Zjih)){\Z,ih)\ - A,)+. 

(c) Increment step, t <— f + 1. 

4. Output bandwidths h* = (hi, . . . , hd) and estimator 

t 

(6.4) m{x) = mh,{x)-Y,{b{s),dh{s)) 

s=l 



Fig. 7. The soft thresholding version of the rodeo. 

6.1. Subtracting off a linear lasso. Local linear regression is a nonpara- 
metric method that contains linear regression as a special case when oo. 
If the true function is linear but only a subset of the variables are relevant, 
then the rodeo will fail to separate the relevant and irrelevant variables 
since relevance is defined in terms of departures from the limiting paramet- 
ric model. Indeed, the results depend on the Hessian of m which is zero in 
the linear case. The rodeo may return a full linear fit with all variables. A 
simple modification can potentially fix this problem. First, do linear variable 
selection using, say, the lasso (Tibshirani [26]). Then run the rodeo on the 
residuals from that fit, but using all of the variables. An example of this 
procedure is given below in Section 6.4. 

6.2. Other estimators and other paths. We have taken the estimate 

(6.1) Dj{h)=Z,{h)Ii\Zjih)\>X,) 
with the result that 

(6.2) m{x) = mhf,{x) - [ {D{s),h{s)) ds = mh*{x). 

Jo 
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There are many possible generalizations. First, we can replace D with the 
soft-thresholded estimate 

(6.3) D,{t)=sign{Z,{h)){\Z,{h)\-\j)^ 

where the index t denotes the ith step of the algorithm. Since hj is updated 
multiplicatively as hj <^ phj, the differential dhj{t) is given by dhj{t) = 
(1 — P)hj. Using the resulting estimate of D{t) and finite difference approx- 
imation for h{t) leads to the algorithm detailed in Figure 7. 

Figure 8 shows a comparison of the hard and soft thresholding versions 
of the rodeo on the example function m{x) = 2{xi + 1)^ + 2sin(10x2) in 
d = 10 dimensions with a = 1; P was set to 0.9. For each of 100 randomly 
generated datasets, a random test point x ~ Uniform(0, 1)"^ was generated, 
and the difference in losses was computed: 

(6.5) {fhha.rd{x) - m{x)f - {fhsoit{x) - m{x)f. 

Thus, positive values indicate an advantage for soft thresholding, which is 
seen to be slightly more robust on this example. 

Another natural extension would be to consider more general paths than 
paths that are restricted to be parallel to the axes. We leave this direction 
to future work. 

6.3. Global rodeo. We have focused on estimation of m locally at a point 
X. The idea can be extended to carry out global bandwidth and variable 
selection by avGraging over multiple cvctluatioii points x\j • • • ? X]^. These could 
be points of interest for estimation, could be randomly chosen, or could be 
taken to be identical to the observed XiS. 

Averaging the Zj^s directly leads to a statistic whose mean for relevant 
variables is asymptotically k~^hjY^\=i'nijj{xi). Because of sign changes in 
mjj{x), cancellations can occur resulting in a small value of the statistic for 
relevant variables. To eliminate the sign cancellation, we square the statis- 
tic. Another way of deriving a global method would be to use the statistic 
sup^|Z*(x)|. 

Let xi, . . . ,Xf^ denote the evaluation points. Let 

n 

(6.6) Z,ix,) = J2YsGjiXs,Xi). 

s=l 

Then define the statistic 

(6.7) T,^lj2z'i^^) = lY^P^Y, 

i=l 

where Pj = GjGj, with Gj{s,i) = Gj{Xs,Xi). 
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Fig. 8. Comparison of hard and soft thresholding. Left: m{x) = ^x\x\, d = 10 and 
a = 0.5; right: m{x) — 2{xi + 1)'^ + 2sin(10a:2), d = 10 and a — 1. The hard and soft 
thresholding versions of the rodeo were compared on 100 randomly generated data sets, 
with a single random test point x chosen for each; (3 = 0.9. The plots show two views of 
the difference of losses, (7Tihard(a:) — m(x))^ — (msoft(a;) — m{x))^ ; positive values indicate 
an advantage for soft thresholding. 



If j G then we have E(Zj(xj)) = o(l), so it follows that, conditionally, 

2 

(6.8a) E(T,) = ^tr(P,) + op(l), 

(6.8b) Var(T,) = _tr(P,P,) + op(l). 

We take the threshold to be 

(6.9) A, = ^ tr(P,) + 2^y^tr(P,P,)log(n). 
Note that if j > r, we have 

(6.10) E{Tj) = ^Y.''M^) + Oihl) 

i 

but for j <r we have 

(6.11) E{T,) = ^Y.^'M^)+0{hl). 

i 

We give an example of this algorithm in the following section, leaving the 
detailed analysis of the asymptotics of this estimator to future work. 
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6.4. Greedier rodeo and LARS. The rodeo is related to least angle re- 
gression (LARS) (Efron et al. [4]). In forward stagewise linear regression, one 
performs variable selection incrementally. LARS gives a refinement where at 
each step in the algorithm, one adds the covariate most correlated with the 
residuals of the cm^rent fit, in small, incremental steps. LARS takes steps of 
a particular size: the smallest step that makes the largest correlation equal 
to the next-largest correlation. Efron et al. [4] show that the lasso can be 
obtained by a simple modification of LARS. 

The rodeo can be seen as a nonparametric version of forward stagewise 
regression. Note first that Zj is essentially the correlation between the l^'s 
and the Gj{Xi,x, h)s (the change in the effective kernel). Reducing the band- 
width is like adding in more of that variable. Suppose now that we make 
the following modifications to the rodeo: (i) change the bandwidths one at 
a time, based on the largest Zj = Zj/Xj, (ii) reduce the bandwidth continu- 
ously, rather than in discrete steps, until the largest Zj is equal to the next 
largest. Some examples of the greedy version of this algorithm follow. 

6.4.1. Diabetes example. Figure 9 shows the result of running the greedy 
version of the rodeo on the diabetes dataset used by [4] to illustrate LARS. 
The algorithm averages ZJ over a randomly chosen set of A; = 100 data 
points, and reduces the bandwidth for the variable with the largest value; 
note that no estimate of a is required. The resulting variable ordering is 
seen to be very similar to, but different from, the ordering obtained from the 
parametric LARS fit. The variables were selected in the order 3 (body mass 
index), 9 (serum), 7 (serum), 4 (blood pressure), 1 (age), 2 (sex), 8 (serum), 
5 (serum), 10 (serum), 6 (serum). The LARS algorithm adds variables in 
the order 3, 9, 4, 7, 2, 10, 5, 8, 6, 1. One notable difference is in the position 
of the age variable. 

6.4.2. Turlach^s example. In the discussion to the LARS paper, Berwin 
Turlach [30] gives an interesting example of where LARS and the lasso fails. 
The function is 

(6.12) Y = {Xi-^f + X2 + X3 + X4 + X5+e 

with ten variables Xi ~ Uniform(0, 1) and a = 0.05. Although Xi is a rele- 
vant variable, it is uncorrelated with Y, and LARS and the lasso miss it. 

Figure 10 shows the greedy algorithm on this example, where bandwidth 
corresponding to the largest average Z* is reduced in each step. We use 
kernel regression rather than local linear regression as the underlying esti- 
mator, without first subtracting off a Lasso fit. The variables X2, X3, X4, X5 
are linear in the model, but are selected first in every run. Variable xi is 
selected fifth in 72 of the 100 runs; a typical run of the algorithm is shown 
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Fig. 9. Greedy rodeo on the diabetes data, used to illustrate LARS (Efron et al. [4])- A 
set of k = 100 of the total n = 442 points were sampled (d = 10^, and the bandwidth for 
the variable with largest average \Zj\/\j was reduced in each step. 

in the left plot. In contrast, as discussed in Turlach [30], LARS selects xi in 
position 5 about 25% of the time. 

Figure 11 shows bandwidth traces for this example using the global al- 
gorithm described in Section 6.3 with k = 20 evaluation points randomly 
subselected from the data, and a taken to be known. Before starting the 
rodeo, we subtract off a linear least squares fit, and run the rodeo on the 
residuals. The first plot shows /ii, . . . , /15. The lowest line is hi which shrinks 
the most since m is a nonlinear function of xi. The other curves are the lin- 
ear effects. The right plot shows the traces for /ig, . . . , /iio, the bandwidths 
for the irrelevant variables. 

7. Proofs of technical results. In this section we give the proofs of the 
results stated in Section 5. We begin with three lemmas. 

We write Yn = Op{an) to mean that Yn = Op{bnan) where 6„ is logarith- 
mic in n. As noted earlier, we write = ^{hn) if liminf„ |^| > 0; similarly 

an = ^{hn) if an = ^^(&nCn) where Cn is logarithmic in n. 



Let 




denote the Hessian of m{x). For a given bandwidth h = (/ii, . . . , /i^), de- 
note the bandwidth matrix by = diag(/if , . . . , /i^). Similarly, let Hr = 






^1 








g- 




3- 


E 






S- 







5 6 7 e 9 10 

Fig. 10. Top; j4 typical run of the greedy algorithm on Turlach's example. The bandwidths 
are first reduced for variables X2,X3,X4,X5, and then the relevant, but uncorrelated with Y 
variable xi is added to the model; the irrelevant variables enter the model last. Bottom: 
Histogram of the position at which variable xi is selected, over 100 runs of the algorithm. 




diag(/if , . . . , /i^). Define 

(7.1) Hj{h) = -^Klmnix) - mix)\X,,. . . ,X„], 

which is the derivative of the conditional bias. The first lemma analyzes 
fj,j{h) and E(/ij(/i)) under the assumption that / is uniform. The second 
lemma analyzes the variance. The third lemma bounds the probabilities 
lP(|-^j| ^ Aj) in terms of tail inequalities for standard normal variables. 

In each of these lemmas, we make the following assumptions. We assume 
that / is uniform, K is a product kernel, and < /? < 1. Moreover, we make 
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Rodeo SIsp 



Rodeo Stop 



Fig. 11. The global rodeo averaged over 10 runs on Turlach^s example. The left plot 
shows the bandwidths for the five relevant variables. Since the linear effects (variables two 
through five) have been subtracted off, bandwidths /12 , /13 , /i4 , /15 are not shrunk. The right 
plot shows the bandwidths for the other, irrelevant, variables. 



use of the following set B of bandwidths 

B = l^h = {hi,...,hd) = iP^'ho, fi^'^hoM, ...,ho): 

(7.2) 



r terms 



d—r terms 



0<%<r„, j = l,...,r|, 

where r„ < cilogn. Finally, we assume that 

(7.3) r = 0{l), 

(7.4) d = Oi 

(7.5) ho = - — :^ for some cq > 0. 



log?! 
log log n 
Co 



log log n 

Lemma 7.1. For each h£ B, 

i^2'injj{x)hj + gj{xR, hR)hj, 



(7.6) nnW) 



0, 



j < r, 
j > r, 
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where V2 is defined in equation (3.5), and gj{xji,hR) depends only on the 
relevant variables and bandwidths, and satisfies 



(7.7) \9jixR,hn)\ =0 ^sup|mj.,fcfc(3;)|/i; 



Furthermore, for any 6 > 0, 



I 

k<r 



\ heB Sj{h) log log n/ n^"" 

l<j<d 



where 



C Jr 1 



with 

(7.10) C = a^ J K^{u)du/f{x). 

Remark 7.2. If we drop the assumption that f{x) = 1 then the mean 
becomes 

(7.11) 

V2mjj{x)hj + o{hj), j<r, 
-ii{HnnR)4{Vj log f{x)fhj + o{hj tv{HR)), j > r. 

Proof of Lemma 7.1. We follow the setup of Ruppert and Wand [22] 
but the calculations need to be uniform over h£ B and we have to allow for 
increasing dimension d. 

Note that there are Nn = {Tn + l)** elements in the set B. Fix an h & B. 
Then hj = Hq for j > r and 

(7.12) P^"ho < hj <ho, 1 < j < r. 



Let 

"Hr 




denote the Hessian of m{x). Let Vm be the gradient of m at x, and let 

(7.13) Q = ((Xi - xfniXi -x),...,{Xn- xfmXn - x)f. 
Note that Vm and Q are only functions of the relevant variables. Then 

(7.14) m{Xi) = m{x) + {Xi - xfVm + \Qi + Ri 



24 J. LAFFERTY AND L. WASSERMAN 

where, using multi-index notation, 



Ri = l i^i - t D''m{{l - s)x + sXi) ds 



(7.15) 



|a|=3 

J2 {x,-xrR^{Xi) 

l"l=3 



for functions Ra that only depend on the relevant variables and satisfy 
(7.16) |i?,(X,)| <isup|Z)"7n(x)|. 



Thus, with M = (m(Xi), . . . , m(X„))'^, 



(7.17) 



m{x) 



+ ^Q + R, 



where R= (i?i, . . . , i?„)^. Since SxXx{m{x),'Vm)'^ =m(x), the conditional 
bias 



(7.18) 

is given by 



bnix) = E{mHix)\Xi, . . . ,Xn) -m{x) 



(7.19a) bn{x) = SxM -m{x) = -SxQ + SxR 

(7.19b) = UjiX'^W^X^r^X'^W^Q + eJiX'^W^X^r^X^W^R 



(7.19c) 



-elT-^Vn + eiT-'-X^W^R, 



where T„ = n-^iX^W^X^) and r„ = n-^{X]:WxQ). 
Analysis ofTn- We write 



(7.20a) T„ 

(7.20b) 

where 

(7.21) 



/ 1 " 

n 

1=1 

1 " 



1 



n 



-Y^Wi{Xi-x) -Y^Wi{Xi-x){Xi 

^ i=i i=i 

An Au 
A21 A22 



j=i V 



■.X{Kh^{x,-X,^). 
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Now, the variance of Wi can be bounded as 

d 



(7.22) Var(T^,) < E{W^) = llv^I[ (^h;^) 

j=i % ij=i "-i ^ "-i ^ 

(7.23) = n IT / ^^(^)/(^ + ^^^^^) 
7.25 =^s]^A 

since / = 1, where C is as defined in (7.10). Therefore, 
(7.26) Var(Aii)<^. 
Also, Wi < A. Hence, by Bernstein's inequality, 

(7.27a) n\Au - E(^u)l > W) < 2«p|-- (^^^-iLL_) } 



(7.27b) =2exp(---2-^ 



2^2 



I 2/i2(l + es 



Now taking e = "j^^^^j and using the definition of /iq = co/loglogn, this 
gives 

(7.28) F(|An-E(^n)|>..(M^) <2exp|-^!^| 

V loglogn/ L 4co J 

(7.29) = 2n-^'''/(^^») 
and so with this choice of e, 

Pfsup '-^"-^(-^"^1 > s,(/.)^^) < 2(r„ + l)^^n--^V(4co) 
VfeeB Sj{h) loglogn/ 

(7.30) 

<^-<x25/(8co)_ 

Also, since /(x) = 1, E(An) = J j^^^^K{H~y^{x - u))f(u)du = f{x). 
Hence, for any e > 0, 

(7.31) pfsup^ill^^>.Uo. 



so{h) 
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Next consider ^21- Now E(^2i) = V2{K)HD where D is the gradient 
of /. Thus, in the uniform / case, E(j42i) = E(Ai2) = 0, and by a similar 
argument as above, P(sup^gg | A12I/S0 > e) — > 0. Turning to ^422, we again 
have convergence to its mean and £(^22) = y2f{x)H. 

Thus, 



(7.32) 
Thus, if 

(7.33) 

then 
(7.34) 



E(T„) 



fix) 
y2f{x)H 







1^2 



fix)/ 



(^, J^nHj,k)-T-\j,k))\ 
So 



1 vT 



> e 



0. 



Analysis of r„ = j^X^W^Q. We can write 



Tn = -X^W,Q: 
n 



(7.35) 



/ 1 " 

' -y^Wi{x,-x)^n{Xi-x) 
1 " 

- V(x, - x)Wi{x, - x)^n{Xi - 



Now, 

(7.36a) E(7i: 

(7.36b) 

(7.36c) 
(7.36d) 



K{v){H^/^v)^n{H^/\)f{x + H^/'^v) dv 

fix) I K{v){H^/\)^n{H^/\)dv 

K{v){H^'\)^n{H^'\)D^{H^I\) dv 

+ i j K{v){H^/\)^n{H^'\){H^'\)^D^{H^I\)dv 
U2f{x)tv{HnR). 



The stochastic analysis of 71 from its mean is similar to the analysis of A12 
and we have I71 — i'2fix)tr{HTCji)\/so{h) = o(l) uniformly. 
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Next, 



(7.37a) E(72)= jK{v){H^/^v){H^^^v)'^n{H^/\)f{x + H^/\)dv 
= /(x) J {H^/^v)K{v){H^/^vfn{H^/\)dv 

+ J K{v){H^/^v){H^'\)^'H{H^'\)D'^{H^I\)dv 
(7.37b) +\ j K{v){H^/^v){H^''^vf 

X n{H^'\){H^'\)^Di{H^/\)dv 
K{v){H^'\){H^'\)^n{H^'\)D'^{H^'\)dv 
(7.37c) +1 J K{v){H^^\){H^/\f 

X n{H^'^v){H^'\)^Di{H^/\)dv 



Thus, 
(7.38) 



iE(r„) 



(u2f{x)tT{HnR] 

\ 



Analysis of remainder ejT ^^^X^WxR- We can write 



(7.39) 



-X'^W:,R-- 
n 



I , n 

n 

1=1 

1 " 

-y^{Xi-x)WiR^ 



Then we have, using the definition of Ra in (7.15), 
(7.40) E(5i) = j K{v) J2 Ra{x + H^''^v){H^/'^v)^f{x + H^I'^v)dv 



|q|=3 



(7.41) =f{x)Y, f K{v)Ra{x + H^/\){H^/\)''dv. 

|a|=3 

Due to the fact that / K{v)v^ dv = for a of odd order, we can expand 
more order to obtain 

(7.42) E{6i) = f{x)Y, j K{v)Ra,{x + H^''^v){H^l^vTdv 
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(7.43) =fix)Y.R^{x,h)h'' 

|a|=4 

for functions Ra which satisfy 

(7.44) \R^{xR,hR)\=o(snp\D"m{x) 
Thus, we have that 



X 



(7.45) E{5i) = 01 ^ swp\mjjkk{x)\hyA. 

\j,k<.r / 

Similarly, 

(7.46a) E(52) =o( ^ sup \mj kk{x)\h^j hU . 

Hence, 

(7.47) ejT-^hilW^R = Op{Y. 

\j,k<r / 

Putting all of this together, we conclude that 

(7.48) E6„(x) = tTiHHa) + g{x, h), 
where 



(7.49) <?(x,/i) = o( ^ sup 

\7,fc<r 



7njjkk{x)\hjhl . 



Taking the derivative with respect to bandwidth hj, for j <r, we obtain 

(7.50) E/ij(/i) = V2mjj{x)hj + gj{x, h)hj, 

where 



(7.51) 5(j(x,/i) = Vsupl 

\k<r ^ 



mjjkkix)\hl . 



The probability bounds established with Bernstein's inequality then give the 
statement of the lemma. □ 

Remark 7.3. Special treatment is needed if x is a boundary point; see 
Theorem 2.2 of Ruppert and Wand [22]. 
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Lemma 7.4. Let Vj{h) =Var{Zj{h)\Xi, . . . ,Xn). Then 



(7.52) 



max 

l<j<d 



1 



for all e > 0. 

Proof. Let i denote the first row of Sx- Then, with ^ N{0,a'^), 
(7.53a) mnix) = ^ m = ^ iim{Xi) + ^ i^Si 

i i i 

(7.53b) Aj2i,m{X,) + f^, 



(7.53c) 

where 
(7.54) 

Thus, 
(7.55) 



Y,^MXi) + 



A 



y/nhi ■■■hd 



A= nhi---hdJ2' 



Var(Zj(t)|Xi,...,X„) =(j2Var 



d A 
dhj y/nhi ■ ■ ■ ha 



Now we find an asymptotic approximation for A. 
Recall that 



(7.56) 



n 



n 



and from our previous calculations 

(7.57) T~^ = (^XjWxXy= -^("^ 
\n ^ 



S,,= {-X^WxXx] -X'^Wx 





(l + op(l))- 



V2 



Note that ^^^^ ^j^g g^try of S'^Sj. But 



(7.58a) SxS^ = ( T-^-X,^ VF,. ) ( T'^-X^W, 



n 



(7.58b) 
(7.58c) 



\t~^x'^wIXx1:-^ 



n 



I 1 
1 



n ^ * 



n 



n 



Y^{X, - x)Wf - Y,{Xi - x){Xi - x)^Wf 
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So is the (1,1) entry of 



/ hi---hd 



T 



-1 



n 



hi---hd 



n 



n 



hi---hd 



Y,{X,-x){X,-xfW^ 



n 



T 



-1 



1 / Oil 0,12 



T 



(7.59) = 

V021 122, 

Next, as in our earlier analysis, 



-1 



E(aii)= j K'^{v)f{x-H^l^v)dv 
= f{x) f K^{v)dv 



(7.60a) 
(7.60b) 

and similarly, E(a2i) = E(a22) = and E(a22) = f{x)i'2H, where 1/2 = / vv"^ x 
K^{v)dv. Hence, the leading order expansion of A2 is given by 



(7.61) 



fix) 



+ 0{tiiH)). 



Taking the derivative with respect to hj we thus conclude that 

a^jK\v)dv 1 



(7.62) Var(Zj(t)|Xi,...,X„) 



fix)h'^- nhi---hd 



l + op(l)). 



which gives the statement of the lemma. □ 

Lemma 7.5. 
1. For any c > and each j > r, 



(7.63) 



P(|Z,-(/io)| >Aj(/io)) = o 



1 



n"- 



2. Uniformly for h£ B we have the following: for any c> 0, j <r, 

l'2\mjj{x)\hj + Zn\ /I 



(7.64) Pi\Zj{h)\ < Xj{h)) < P( A^(0, 1) > 
where Zn = 0{h^). 



Sjih) 



+ 



Proof. Proof of (1). Fix (5 > 0,c > 0. By the previous lemmas, there 
exists a sequence of sets Vn and sequences of constants Ci,n,'^2,n such that 
Ci,n < V^logTI/loglogn, 6,n ^ 0, ¥{V^) = 0(n~'5^'/(S'=o)). On K we have 
that 



(7.65) 



l/"i(^o)|/Sj(/lo) <il,r 
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and 

(7.66) \Sj{h)/^V,{h^)-l\<i2,n. 

The events Vn depend on the Xj's but not e^'s. Choosing 5 large enough we 
have that IP(V^) = o{n~'^). So, for ah j > r, 

(7.67a) P(Z,(/io)>A,(/io)) 



(7.67b) 
(7.67c) 
(7.67d) 
(7.67e) 
(7.67f) 

(7.67g) 
(7.67h) 



Zj(/io) - fij{ho) ^ \j{ho) - Hj{ho) 



Zj{ho) - fijjho) ^ Xjjho) - lijiho) 



P(^iV(0,l) > 
iV(0,l)> 



Xj{ho) - fij{ho) 



Vj{ho) 



X 



1, ■ ■ ■ , Xn 



Vj{ho) 



Sj{ho) 



Sj{ho) 



F{N{0, 1) > v/2b^(l - C2,n) - 6,n(l + 6,n)) + O 



n' 



d 



and the result follows from the normal tail inequality and the fact that 
V 2 log n — ^i^n > V(2 — 7) logn for any 7 > 0. 

Proof of (2). By the previous lemmas, there exists a sequence of sets 
Vn and sequences of constants Ci,n,(,2,n such that < \/71ogn/ log logn, 
S.2,n 0, IP(1^^) = o(l/n'^) for any c > and on Vn we have that \nj{h) — 
V2mjj{x)hj + 0{hj)\/sj{h) < ^i^n and \sj{h)/^/vJ{h) — 1| < ^2,n- The events 
Vn depend on the Xj's but not Sj's. Without loss of generality assume that 
mjj{x) > 0. Then 

(7.68a) F{\Zj{h)\ < Xj{h)) 

Xj{h)-fij{h) 



(7.68b) 
(7.68c) 



<N{0,1) < 



< 



-00 < 



Vj{h) 

iV(0,l)<M4^M^) 
v,{h) J 



Vj{h) 
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(7.68d) =F(N{0,l)>t^^zM^) 

(7.68e) =fUo,1)>^^^^^M^4^) 



(7.68g) 



□ 



7.1. Proof of Theorem 5.1. 



Proof of Theorem 5.1. Let At be the active set at step t. Define At to 
be the event that = {1, . . . , r}. Let Ct = {At = 0}. Recall the definitions 
of ti and tjj from equations (5.4) and (5.5). We will show that 

(7.69) p(^Ct^n(^n^,^^^i 

from which the theorem follows. We analyze the algorithm as it progresses 
through steps 1, . . . , t, . . . T„. Fix c > 0. In what follows, we let ^,n{c) denote a 
term that is o{n~'^); we will suppress the dependence on c and simply write 

in. 

Step t = \. Define the event 

(7.70) Bi = {\Zj\> Xj for all j<r}n{\Zj\< Xj for all j > r} . 
Thus, Ai = Bi. We claim that 

(7.71) r{Bt)<- + in. 

n 

To show (7.71), we proceed as follows. First consider j > r. From Lemma 
7.5, 

(7.72) P f max \Zj\> A^) < V F{\Zj\>Xj)< d^n = in- 



Now consider j < r. Note that fj!j (h) / sj (h) > Slogn and hence, from Lemma 
7.5, 

(7.73) ¥{\Zj I < Xj for some j <r)<o(-) + in- 



n 



This proves (7.71). 
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Step t = 2. Let h = {hi, . . . , hd) be the random bandwidth at step t = 2. 
Let 

(7.74) K = { pho,. fiho , hp,..., hp). 

r terms d-r terms 

Let B2 = {\Zi \ > Xi, . . . ,\Zr\ > Xr}- Then A2 = Bi f] B2 and h = h^, on A2- 
Now, P(^^) < P(5f) +P(Sf) and 

(7.75) W{B^) = ¥{B^, Ai) + F{B^, Al) < F{B^, Ai) + ¥{Al) 

(7.76) =¥{BI,Ai) + - + ^n 

n 

(7.77) =¥(mS^n\Zj(h)\<Xj,A^+-+in 

\j<r J n 

(7.78) =pfmin|Zj(/i*)| < Aj,Ai) + ^ + 

\3<r J n 

(7.79) < pfmin|Zj(/i,)| < A.) + - + 

\j<r / n 

(7.80) <i + ^„+('l + ^^ 
where the last step foUows from the same argument as in step 1. So, 

(7.81) P(^^) < F{A\) + P(SS) < 2P(^?) + - + 2^„. 

n 

S'tep i /or t <tL- Let /i = (/ii, . . . , hd) be the random bandwidth at step 
t. Let 

(7.82) K = iP'-^hp, . . . , /3*-^/io, /lo, .■.,hp). 

^ V ' " V ' 

r terms d-r terms 

Let Bt = {\Zi\ > Al, . . . , \Zr\ > Xr}. Then = flLi^^ and h = h^ on At. 
Now, P(A^) < ELi1P(-S,') and 

(7.83) FiB^)=F{B^,At^i)+F{B^,AU) 

(7.84) <P(i?f,ylt_i)+P(A?_i) 

(7.85) =¥{BlAt-i) + - + Cn 

n 

(7.86) =pfmin|Zj(/i)| <Aj,^t_i) +- + Cn 

(7.87) =pfmin|Zj(/i,)| < Aj,Aj_i) +i+en 

\j<T J n 
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(7.88) <¥(mm\Zj{h,)\<Xj] 

\j<r J n 

(7.89) < - + + - + 

n n 

from Lemma 7.5 and the fact that iJ,'j{h)/s'j{h) > 81ogn for all t < tjj. Now 



^)<P(A?„i) + 



3' 



(7.90) <^^AU) + [nAU) + - 



n 



and so, by induction, 



n 



r)t-l / 1 \ t-2 



n \n 

(7.91) 

<-+2*en = o(i) 

n 

since 2*^ri(c) = o(l) for sufficiently large c, for all t < ti. 

Step t = tjj. Fix < a < 2. We use the same argument as in the last 
case except that fij (h) / (h) < alogn for t = tu- Let x solve a = 4 — 2% — 
4-y/l — X- Then < x < 1 and V^logn — V a log n > ^2(1 — x) logn. By 
Lemma 7.5, 



"(Ctr,) <P max|Zj| > A,- 



(7.92) < rP(iV(0, 1) > J2{1 - x) logn) + ^„ 



Summarizing, 
(7.93) l-F\^Ct,n{ 1=0(1) 

which proves the theorem. □ 



Proof of Corollary 5.2. First note that, for any deterministic band- 
widths h* satisfying equations (5.2) and (5.3), we have that the squared 
(conditional) bias is given by 

(7.94a) Bias'^{mh*)= (Y^^J^f^ +op{tr{H'')) 

\j<r J 
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(7.94b) = AiAjhfhf + op{tiiH*)) 

i,j<r 

(7.94c) =Op(n-4/(^+'')+") 

by Theorem 5.1. Similarly, from Theorem 7.4 the (conditional) variance is 
(7.95a) Var(m,.) = ^(11)^) T^^^'^^ + 

(7.95b) =Op(n-^+^-/(^-+'^)+^) 
(7.95c) =Op(n-4/(4+'^)+^), 

where R{K) = J K{u)'^ du. Let h* denote the random bandwidths output 
from the algorithm. There exists sets Vn such that IP(V^) = o(l) and on 
the bandwidths satisfy equations (5.2) and (5.3). Let (5„ = n"^^/^^^''))'^^. It 
follows that 

¥{\fhh*{x)-m{x)\ >5n) 

(7.96) 

<n\fnh*{x)-ni{x)\>5n,Vn)+ny:)=o{i). □ 
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